Tag
3 articles
The ARC-AGI-3 benchmark challenges AI systems to match untrained human performance in interactive environments, with no frontier model achieving more than 1% success. The test strips away AI's typical advantages, exposing a gap in reasoning and adaptability.
ServiceNow Research introduces EnterpriseOps-Gym, a high-fidelity benchmark to evaluate agentic planning in realistic enterprise environments. The tool addresses key challenges like long-horizon planning and access controls.
OpenAI plans to retire the SWE-bench Verified benchmark, citing flaws that undermine its validity as a coding performance measure. The move highlights concerns about memorization in AI model evaluations.